Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

freddiev4
Copy link

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

This PR is a continuation of #257 -- shepherding the PR across!

What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

  1. Classifiers (binary evaluations like True/False).
  2. Graders (scored evaluations on numerical scales).

The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

  • Demonstrates how to use judges with litellm to evaluate AI search engine responses.
  • Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
    • Correctness (factual accuracy).
    • Quality (clarity, helpfulness).
  • Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources

Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@merveenoyan
Copy link
Collaborator

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

@freddiev4
Copy link
Author

freddiev4 commented Jan 12, 2025

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother.

I’m happy to copy over the comments from the previous PR as well if that helps!

Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me.

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use notebook_login instead


Reply via ReviewNB

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could explain the error or ask users to ignore imo, otherwise it's confusing


Reply via ReviewNB

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left some nits, otherwise looks good! @stevhliu should review too

@@ -0,0 +1,1680 @@
{
Copy link
Member

@stevhliu stevhliu Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...research-backed evaluator prompts..."


Reply via ReviewNB

@@ -0,0 +1,1680 @@
{
Copy link
Member

@stevhliu stevhliu Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be easier to consume this content in table-form

| Judge | What | Why | Source | When to use |

|---|---|---|---|---|

| PollMultihopCorrectness | | | | |

| PrometheusAbsoluteCoarseCorrectness | | | | |

| MTBenchChatBotResponseQuality | | | | |


Reply via ReviewNB

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few more comments and then we can merge! 🤗

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
- [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct)
- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants